validation warnings set stacklevel=2 #377

bmcfee · 2024-03-26T16:54:49Z

Fixes #372

All validation functions now set stacklevel=2.

Before, line number comes from validate:

n [4]: mir_eval.beat.evaluate(ref_data, np.array([]))
/home/bmcfee/git/mir_eval/mir_eval/beat.py:93: UserWarning: Estimated beats are empty.
  warnings.warn("Estimated beats are empty.")

After, line number comes from the metrics:

In [9]: mir_eval.beat.evaluate(ref_data, np.array([]))
/home/bmcfee/git/mir_eval/mir_eval/beat.py:165: UserWarning: Estimated beats are empty.
  validate(reference_beats, estimated_beats)
/home/bmcfee/git/mir_eval/mir_eval/beat.py:207: UserWarning: Estimated beats are empty.
  validate(reference_beats, estimated_beats)
/home/bmcfee/git/mir_eval/mir_eval/beat.py:267: UserWarning: Estimated beats are empty.
  validate(reference_beats, estimated_beats)
/home/bmcfee/git/mir_eval/mir_eval/beat.py:359: UserWarning: Estimated beats are empty.
  validate(reference_beats, estimated_beats)
/home/bmcfee/git/mir_eval/mir_eval/beat.py:454: UserWarning: Estimated beats are empty.
  validate(reference_beats, estimated_beats)
/home/bmcfee/git/mir_eval/mir_eval/beat.py:614: UserWarning: Estimated beats are empty.
  validate(reference_beats, estimated_beats)

As it currently stands, we do a lot of redundant validation when using the evaluate() functions. It might be worth considering an enhancement where evaluate() validates the data once at the top, and then each underlying metric provides an option to bypass validation thereafter.

I've proposed similar things in the past #341 as an optimization hack, but in this case, it's now also a usability issue because the relevant stacklevel to warn is different when metrics are called directly (2) vs called from evaluate (3).

I think it could make sense to roll that into this PR, or not. Curious for other folks' input.

codecov-commenter · 2024-03-26T17:05:57Z

Codecov Report

Attention: Patch coverage is 97.88136% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 88.65%. Comparing base (7997fdf) to head (fd2b880).
Report is 5 commits behind head on main.

Files	Patch %	Lines
mir_eval/multipitch.py	50.00%	4 Missing ⚠️
mir_eval/chord.py	98.41%	1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #377      +/-   ##
==========================================
+ Coverage   88.31%   88.65%   +0.34%     
==========================================
  Files          19       19              
  Lines        2875     2971      +96     
==========================================
+ Hits         2539     2634      +95     
- Misses        336      337       +1

Flag	Coverage Δ
unittests	`88.65% <97.88%> (+0.34%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

bmcfee · 2024-03-26T17:14:07Z

... of course, @craffel already suggested this 3 years ago 😆

#341 (comment)

craffel · 2024-03-26T20:20:07Z

Thanks. What does this look like when you call a metric function directly? Regarding redundant calls to validate, this probably belongs in the other issue, but there must be some Python wizardry that basically says: If the same function is called with the same arguments within this context, skip the call. Basically some kind of argument caching/memoization but instead of caching the output, just skip the call completely, and evict the cache when you exit the context. That would work, right?

bmcfee · 2024-03-26T20:23:14Z

Took the liberty of adding the safe= parameter to validators and evaluator functions. This is WIP, I'm up through (alphabetically) the segment module. TBD are separation, tempo, transcription, and transcription_velocity.

In all cases, the default behavior is unchanged. Everything is validated by default. However, we now avoid doing multiple validation calls on the same data. Doing it this way means that stacklevel=2 always kicks back to the call site of validation, regardless of whether the user called evaluate() or a more specific metric.

I also took the liberty of bumping up the chord parser regexp into a module-level compiled pattern. Recompiling that monster regexp for every chord label was probably the source of the inefficiency I mentioned back in #341.

I'll report back when all safety parameterizations are done.

bmcfee · 2024-03-26T20:26:09Z

Thanks. What does this look like when you call a metric function directly? Regarding redundant calls to validate, this probably belongs in the other issue, but there must be some Python wizardry that basically says: If the same function is called with the same arguments within this context, skip the call. Basically some kind of argument caching/memoization but instead of caching the output, just skip the call completely, and evict the cache when you exit the context. That would work, right?

Hah, sorry, crossed wires here. 😁

The idea is that if the user calls evaluate(..., safe=True), the validation happens once there, and is then subsequently skipped (safe=False). This way we never have redundant calls, and the stacklevel points to the right place.

If the user calls a metric directly with safe=True, it still validates.

If the user calls either with safe=False, then validation is skipped entirely. (Dangerous, but YOLO, and it makes sense in certain circumstances.)

Regarding memoization, yeah we could do that, but I don't think it actually buys us much here. It would add at least one dependency (eg joblib) and probably complicate the stacklevel logic more than the solution above.

craffel · 2024-03-26T20:29:33Z

This works, but doesn't it seem drastic to put a new arg into basically all of the functions just to avoid redundant calls in evaluate? It feels possibly worth it to do the Python wizardry to avoid adding an arg everywhere.

bmcfee · 2024-03-26T20:37:21Z

This works, but doesn't it seem drastic to put a new arg into basically all of the functions just to avoid redundant calls in evaluate? It feels possibly worth it to do the Python wizardry to avoid adding an arg everywhere.

I wouldn't consider it drastic if it's implemented consistently across the entire package (which is my plan). It's perhaps a little clumsy, but it's easy to understand and use.

A slick pythonic alternative would be to bundle everything in a decorator that abstracts out the validation / bypass logic before getting onto the core of the evaluation code. This might be a little fancier on the dev side, but the user experience would be pretty much the same. In this case, as much as I like a slick decorator trick, I'd favor readability and transparency.

bmcfee · 2024-03-28T13:24:49Z

... forgot to comment that I'd finished with this retrofit, and I think it's ready for CR.

Thinking more about potential python wizardry, I think the only reason to do that rather than the explicit parameterization I've implemented here is if we think we might change things later on. This kind of thing happens more for things like deprecation warnings / parameter renames / etc., which can change frequently over the lifecycle of a project, so there's benefit to making it easy to add or remove at any point. I don't think that applies in this case.

I do think we need validation bypassing as implemented here if we want all of the following:

inputs are (can be) validated regardless of how a metric is called (directly or through evaluate)
warning stacklevels point to the call site for the user and not within mir_eval
validation can be disabled by the user

The only remaining question for me right now is whether the stacklevel should be 2 or 3. The call path looks something like:

validate (1) ← [evaluate || specific metric] (2) ← user code (3)

main branch currently leaves it at 1, which is not helpful. PR currently does 2, which would point to either evaulate or the specific metric being called; this seems marginally more helpful. I think (3) might be best as it points to where the user is entering the mir_eval code path. The only downside here is that if a user is calling validate() directly, then the stack level might point one level higher up than it should. I expect this case is exceedingly rare and not something we should worry about, but I did want to note it here.

craffel · 2024-03-28T16:06:56Z

Sorry for the delay in responding, as I mentioned my latency is a bit unpredictable these days. I appreciate the work you put into this but I don't think adding an additional argument to most user-facing functions in the library is the right solution - this hurts usability/readability significantly in addition to potential backwards compatibility issues with positional arguments and therefore should be a last resort. I also don't think it's important to support users disabling validation - the validation logic is there to catch problems that otherwise would cause issues and throw more helpful errors/warnings; the only reason it's in a separate function is to avoid code duplication in all of the metric functions. If there was no other option, I think it would make sense to consider adding an arg, but in this case the issue it solves is minor enough (repeated warnings for malformed inputs when evaluate is called) that I'm not even sure it's warranted. In any case I don't think we need to address this issue or come up with a solution for 0.8.0 so I would be most inclined to punt on it for now.

Regarding the stack level, I'm not sure since IME the stack level reported by warnings is almost never helpful... but I agree that validate should almost never be called by an end-user so (3) is probably best.

bmcfee · 2024-03-28T16:14:54Z

his hurts usability/readability significantly in addition to potential backwards compatibility issues with positional arguments and therefore should be a last resort.

I don't think this affects positional arguments in existing code, as it's always the last new (positional) parameter.

I also don't see how this hurts usability or readability: the default behavior is preserved, and the behavior is explicitly stated in documentation. At most it's one extra piece of information in the docs?

In any case I don't think we need to address this issue or come up with a solution for 0.8.0 so I would be most inclined to punt on it for now.

I'm okay with punting if you think it's really unimportant.

Regarding the stack level, I'm not sure since IME the stack level reported by warnings is almost never helpful... but I agree that validate should almost never be called by an end-user so (3) is probably best.

I actually do use warning stacklevels somewhat often, so if we have a chance to do it right, I think we should. That said, I don't think we can do it right without the change I've implemented here, since there are two alternative routes to the validate function with different stacklevels (user→metric→validate and user→evaluate→metric→validate). This is why I went to the effort of lifting out the validate call and adding bypass options at all. I agree with your earlier point that disabling validation entirely is of negligible utility, but I don't think there's another good option here that still produces a correct stacklevel.

(A grungier alternative would be to pass the target stacklevel through to validate, but that's much less generally readable or usable than a bypass option.)

craffel · 2024-03-29T14:15:51Z

I also don't see how this hurts usability or readability: the default behavior is preserved, and the behavior is explicitly stated in documentation. At most it's one extra piece of information in the docs?

Adding an argument to almost all user-facing functions that is almost never used harms readability and usability. I'm not saying it's catastrophic, I'm just saying it should be a last resort - if it can be avoided (which I believe it can) it should be avoided.

Regarding the stack level, thinking about it more, I think stack level 2 is reasonable - it's flagging for the user that the warning is coming from some particular metric function. If that's inside an evaluate call, that's ok, because ultimately all evaluate does is do some preprocessing and then call the metric functions. I don't think we can in general assume some point in the stack level corresponds to the point where the user has constructed and passed off the data being evaluated, because one level above the evaluate/metric function call could either be user code or some other library code that the user is using on top of mir_eval.

bmcfee · 2024-03-29T15:47:01Z

Ok, I'll close this out then.

craffel · 2024-04-04T21:35:23Z

Just to clarify, I wasn't saying to nuke this entirely - I still think stacklevel=2 is a QoL improvement over the current behavior and is a simple change that should be merged in.

bmcfee · 2024-04-05T14:54:21Z

Just to clarify, I wasn't saying to nuke this entirely - I still think stacklevel=2 is a QoL improvement over the current behavior and is a simple change that should be merged in.

Yeah I understand. But as I've said above, stacklevel=2 without accounting for the alternative code paths is not actually that helpful. In fact, it can be worse than status quo because it triggers separate warnings every time the data is validated (when called by evaluate()):

In [9]: mir_eval.beat.evaluate(ref_data, np.array([]))
/home/bmcfee/git/mir_eval/mir_eval/beat.py:165: UserWarning: Estimated beats are empty.
  validate(reference_beats, estimated_beats)
/home/bmcfee/git/mir_eval/mir_eval/beat.py:207: UserWarning: Estimated beats are empty.
  validate(reference_beats, estimated_beats)
/home/bmcfee/git/mir_eval/mir_eval/beat.py:267: UserWarning: Estimated beats are empty.
  validate(reference_beats, estimated_beats)
/home/bmcfee/git/mir_eval/mir_eval/beat.py:359: UserWarning: Estimated beats are empty.
  validate(reference_beats, estimated_beats)
/home/bmcfee/git/mir_eval/mir_eval/beat.py:454: UserWarning: Estimated beats are empty.
  validate(reference_beats, estimated_beats)
/home/bmcfee/git/mir_eval/mir_eval/beat.py:614: UserWarning: Estimated beats are empty.
  validate(reference_beats, estimated_beats)

Since these are coming from different lines in the module (due to stacklevel) these appear to be different, and are thus not absorbed by the "once" default rule for UserWarnings. When stacklevel is not set (status quo), repeat warnings appear to come from the same place, and are (correctly, imo) absorbed by the warnings filter:

In [6]: mir_eval.beat.evaluate(ref_data, np.array([]))
/home/bmcfee/git/mir_eval/mir_eval/beat.py:93: UserWarning: Estimated beats are empty.
  warnings.warn("Estimated beats are empty.")

Now, Python 3.12 actually added some functionality to do this in exactly the right way with the skip_file_prefixes parameter, but I don't think we can bump the minimum python to 3.12 just for this feature.

Which is all to say: I don't think setting stacklevel=2 universally is correct or more helpful than the current behavior, and I also don't see a cleaner way to achieve the most helpful behavior than what I'd implemented in this PR.

craffel · 2024-04-12T17:58:04Z

Ah, I misunderstood the issue. I'll work on a solution to avoiding repeat calls to validate after my leave ends.

validation warnings set stacklevel=2, fixes mir-evaluation#372

724a234

bmcfee added the enhancement label Mar 26, 2024

bmcfee added this to the 0.8 milestone Mar 26, 2024

blacked the package

692b84e

bmcfee added 6 commits March 26, 2024 16:01

pulling safe= parameter through evaluators

51ee846

safety param in hierarchy

089bd79

more safety params

17dd57b

more safety params

97d1a1f

more safety params

021acda

safety params in segment

a188f78

bmcfee added 2 commits March 26, 2024 20:37

safety in separation

d62243f

finished safety param threading

fd2b880

bmcfee closed this Mar 29, 2024

bmcfee mentioned this pull request Mar 29, 2024

More efficient chord validation #378

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

validation warnings set stacklevel=2 #377

validation warnings set stacklevel=2 #377

bmcfee commented Mar 26, 2024

codecov-commenter commented Mar 26, 2024 •

edited

Loading

bmcfee commented Mar 26, 2024

craffel commented Mar 26, 2024

bmcfee commented Mar 26, 2024

bmcfee commented Mar 26, 2024

craffel commented Mar 26, 2024

bmcfee commented Mar 26, 2024

bmcfee commented Mar 28, 2024

craffel commented Mar 28, 2024

bmcfee commented Mar 28, 2024

craffel commented Mar 29, 2024

bmcfee commented Mar 29, 2024

craffel commented Apr 4, 2024

bmcfee commented Apr 5, 2024

craffel commented Apr 12, 2024

validation warnings set stacklevel=2 #377

validation warnings set stacklevel=2 #377

Conversation

bmcfee commented Mar 26, 2024

codecov-commenter commented Mar 26, 2024 • edited Loading

Codecov Report

bmcfee commented Mar 26, 2024

craffel commented Mar 26, 2024

bmcfee commented Mar 26, 2024

bmcfee commented Mar 26, 2024

craffel commented Mar 26, 2024

bmcfee commented Mar 26, 2024

bmcfee commented Mar 28, 2024

craffel commented Mar 28, 2024

bmcfee commented Mar 28, 2024

craffel commented Mar 29, 2024

bmcfee commented Mar 29, 2024

craffel commented Apr 4, 2024

bmcfee commented Apr 5, 2024

craffel commented Apr 12, 2024

codecov-commenter commented Mar 26, 2024 •

edited

Loading